Robot audiovisual system

ABSTRACT

A robot visuoauditory system that makes it possible to process data in real time to track vision and audition for an object, that can integrate visual and auditory information on an object to permit the object to be kept tracked without fail and that makes it possible to process the information in real time to keep tracking the object both visually and auditorily and visualize the real-time processing is disclosed. In the system, the audition module ( 20 ) in response to sound signals from microphones extracts pitches therefrom, separate their sound sources from each other and locate sound sources such as to identify a sound source as at least one speaker, thereby extracting an auditory event ( 28 ) for each object speaker. The vision module ( 30 ) on the basis of an image taken by a camera identifies by face, and locate, each such speaker, thereby extracting a visual event ( 39 ) therefor. The motor control module ( 40 ) for turning the robot horizontally extracts a motor event ( 49 ) from a rotary position of the motor. The association module ( 60 ) for controlling these modules forms from the auditory, visual and motor control events an auditory stream ( 65 ) and a visual stream ( 66 ) and then associates these streams with each other to form an association stream ( 67 ). The attention control module ( 6 ) effects attention control designed to make a plan of the course in which to control the drive motor, e.g., upon locating the sound source for the auditory event and locating the face for the visual event, thereby determining the direction in which each speaker lies. The system also includes a display ( 27, 37, 48, 68 ) for displaying at least a portion of auditory, visual and motor information. The attention control module ( 64 ) servo-controls the robot on the basis of the association stream or streams.

TECHNICAL FIELD

[0001] The present invention relates to a visuoauditory system that issuitable for use with a robot, e.g., a robot of human type (“humanoid”)or animal type (“animaloid”). The term “robot” used herein, however, isnot intended to refer only to a humanoid or animaloid robot, but used toencompass those for industrial applications as well and generally torefer an active electronic machine that can be controlled automatically.

BACKGROUND ART

[0002] For active electronic machines such as robots of human and animaltypes, attention has in recent years been drawn to active senses ofvision and audition. A sense by a sensory device provided in a robot forits vision or audition is made active (active sensory perception) when aportion of the robot such as its head carrying the sensory device isvaried in position or orientation as controlled by a drive means in therobot so that the sensory device follows the movement or instantaneousposition of a target or object to be sensed or perceived.

[0003] As for active vision, studies have diversely been undertakenusing an arrangement in which at least a camera as the sensory deviceholds its optical axis directed towards an object by being controlled inposition by the drive means while permitting itself to perform automaticfocusing and zooming in and out relative to the object to take a picturethereof.

[0004] As for active audition or hearing, at least a microphone as thesensory device may likewise have its facing kept directed towards atarget or object by being controlled in position by the drive mechanismto collect a sound from the object. Such active audition may refer tovisual information to determine the direction in which the sound sourcelies, as disclosed by the present applicant in Japanese patentapplication No. 2000-22677 entitled “Robot Auditory System”.

[0005] By the way, the active vision and audition are closely related toa motor control module for changing the direction of a robot (in ahorizontal plane). In order to make its active vision and audition workwith respect to a specific object, it is necessary to direct the robottowards the specific object, i.e., to make an attention control.

[0006] Combining vision and audition with a motor control module in turnrequires processing data in real time to make track for vision andaudition. In the conventional robot development efforts, however, whilethere has been developed a real-time processing system for a singlesound source object, no attempt has been made to develop an activeauditory system in which data are processed in real time in a situation,e.g., that people are talking to each other, to identify each individualperson.

[0007] For a robot to precisely identify each individual speaker as aspecific object on the basis of its environmental conditions requiresvisual and auditory data to be integrated. No active auditory system hasbeen developed in which such data are processed in real time in asituation, e.g., that people are talking to each other, to identify eachindividual person.

[0008] For vision and audition to be united with controlling a motorcontrol module, not only is it necessary to process data in real time tomake track for vision and audition, but also it is extremely useful toprocess in real time data for the internal state and to visualize it insuch a visual and auditory tracking process. In the conventional robotdevelopment efforts, however, while there has been developed such areal-time processing system for a single sound source object, no attempthas been made to develop an active auditory system in which such dataare processed in real time in a situation, e.g., that people are talkingto each other, to identify each individual person, nor has there beenany attempt to make such visualization in real time.

[0009] Also, while the attention control for a drive motor in a motormodule has so far been undertaken using either vision servo or auditoryservo as it is called, no system has been proposed whereby a robot isaccurately controlled using visuoauditory servo, i.e., using both itsvision and audition concurrently.

DISCLOSURE OF THE INVENTION

[0010] With the foregoing taken into account, it is a first object ofthe present invention to provide a robot visuoauditory system that makesit possible to process data in real time to track vision and auditionfor an object or target, and to integrate information from sensors forvision, audition and a motor in such a manner that if any portion of theinformation is missing or broken off, the remaining information maycomplement it to allow the object to be continuingly tracked.

[0011] A second object of the present invention is to provide a robotvisuoauditory system that can integrate visual and auditory informationon an object to permit the object to be kept tracked without fail.

[0012] A third object of the present invention is to provide a robotvisuoauditory system that makes it possible to process the informationin real time to keep tracking the object both visually and auditorilyand at the same time to visualize the real-time processing.

[0013] A fourth object of the present invention is to provide avisuoauditory system for a robot whereby the robot is visuoauditorilyservo-controlled using both its vision and audition with an object kepttracked both visually and auditorily.

[0014] The first object mentioned above is attained in accordance withthe present invention in a first aspect thereof by a visuoauditorysystem for a robot characterized in that it comprises: an auditionmodule including at least a pair of microphones for collecting externalsounds; a vision module including a camera for taking an image in frontof the robot; a motor control module including a drive motor for turningthe robot horizontally, an association module for combining events fromsaid audition module, said visual module and said motor control moduleto produce streams; and an attention control module for effectingattention control on the basis of the streams produced by saidassociation module, whereby: the said audition module in response tosound signals from the said microphones is adapted to extract pitchestherefrom, separate their sound sources from each other and locate soundsources such as to identify a sound source as at least one speaker,thereby extracting an auditory event; the said vision module on thebasis of an image taken by the camera is adapted to identify by face,and locate, each such speaker, thereby extracting a visual eventtherefor; the said motor control module in response to a rotary positionof the said drive motor is adapted to extract a motor event therefrom;the said association module on the basis of the auditory event from thesaid audition module, the visual event from the said vision module andthe motor event from the said motor control module is adapted to form anauditory stream and a visual stream and then associate these streamswith each other to form an association stream; and the said attentioncontrol module in effecting the attention control is adapted to make aplan for a course in which to control the said drive motor in the saidmotor control module in accordance with the said streams.

[0015] In this robot visuoauditory system according to the presentinvention, the said association module when forming the said auditoryand visual streams is preferably adapted to synchronize said auditory,visual and motor events which are formed asynchronously with each other.

[0016] In this robot visuoauditory system according to the presentinvention, preferably the said association module is made a server andeach of the said audition, vision and motor control modules are made aclient connected to the said server in.

[0017] In this robot visuoauditory system according to the presentinvention, the said server and each said client are interconnected viaLAN.

[0018] In this robot visuoauditory system according to the presentinvention, when the said attention control module effects the attentioncontrol, higher priority is preferably given in the order of theassociation stream, the auditory stream and the visual stream.

[0019] The system so configured as mentioned above allows the auditionmodule to extract pitches from sounds collected by the microphones fromexternal objects by using a harmonic structure thereof to find andidentify the direction in which each of the sound sources as individualspeakers lies, thereby extracting an auditory event for each objectspeaker.

[0020] Also, the vision module on the basis of an image taken by acamera is allowed to identify by face, and locate, each such speaker,thereby extracting a visual event therefor.

[0021] Further, the motor control module is allowed to extract a motorevent by detecting the direction in which the robot faces on the basisof a rotary position of the drive motor for turning the robothorizontally.

[0022] At this point it should be noted that the term “event” as usedherein is intended to refer to the occurrence that at a given point oftime, a sound or face is detected or a feature such a pitch and itsdirection is extracted to identify, recognize or discriminate thespeaker or face, or the state that the drive motor is in rotation. Theterm “stream” as used herein is intended to refer to eventsinterconnected so as to continue with time while effecting errorcorrection.

[0023] Here, the association module on the basis of an auditory, avisual and a motor event thus extracted, respectively, forms an auditoryand a visual stream for each speaker and, upon associating these streamswith each other, form an association stream. Then, the attention controlmodule on the basis of these streams effects attention control, therebymaking a plan of the course in which to control the drive motor in themotor control module.

[0024] The term “attention” as used herein is intended to causing therobot to pay attention auditorily and/or visually to an object speaker.Likewise, the term “attention control” as used herein is intended torefer to the motor control module causing the robot to change itsposition and/or direction or orientation and thereby to pay attention tothe object speaker.

[0025] Thus, the attention control module directs the robot to facetowards an object speaker by controlling the drive motor in the motorcontrol module on the basis of the plan made. Thus, directing the robotto face opposite to the object speaker allows the audition module tocollect through the microphones a voice of the object speaker in thefront direction in which the voice can be collected at a highSignal-to-noise (S/N) ratio and thereby to locate the object speakeraccurately, while at the same time permitting the vision module to takethrough the camera a clear-cut image of that object speaker.

[0026] Also, even when either its auditory or visual event is missing orbroken off, an object speaker can be perceived by the attention controlmodule on the basis of only its visual or auditory event that remainsextant, thereby permitting the motor control module to be controlled inreal time.

[0027] Accordingly, making the audition module, the vision module andthe motor control module to cooperate with the association module andthe attention control module allows the ambiguities which the auditionand vision of the robot individually possess to complement each other;thereby rising the so-called robustness of the system. Moreover, morethan one speakers can be thereby individually perceived.

[0028] In case the association module when forming an auditory and avisual stream synchronizes an auditory, a visual and a motor event witheach other if they are formed asynchronously with each other, the factthat the auditory, visual and motor events which are formedasynchronously with each other are synchronized with each other allowsdifferent portions of the periods in which these events are formed inthe association module and different portions of the delay times thenentailed to be absorbed and, and allows an auditory stream made of theauditory event and a visual stream made of the visual event if theyexist close to each other in distance to be associated with each otherto form a higher order association stream.

[0029] In case the association module is made a server and each of theaudition, vision and motor control modules are made a client connectedto the server and the server and each such client are interconnected viaLAN, the fact that the clients connected to the server via LAN aresubjected to distributed processing allows their events to be processedrapidly in real time.

[0030] Giving higher priority in the order of an association, anauditory and a visual stream when the attention control module effectsattention control, namely having primary reference to the associationstream and subsidiary reference to the auditory and visual streamsallows a particular speaker to be kept tracked, e.g., if it fails toacquire a visual event of the speaker and thus it fails to form a visualstream thereof, then by continuingly acquiring an auditory event to forman auditory stream thereof and effecting attention control on the basisof the auditory stream formed. Further, if more than one streams of thesame type exist, priority may be given to the oldest or the newest,depending on particular circumstances encountered.

[0031] In this way, dealing with events and streams, each symbolic, usedto represent auditory and visual information allows displaying in realtime in the display since the amount of data can be largely compressedcompared with dealing with unsymbolized sound and image data.

[0032] The second object mentioned above is achieved in accordance withthe present invention in a second aspect thereof by a visuoauditorysystem for a robot, characterized in that it comprises: an auditionmodule including at least a pair of microphones for collecting externalsounds; a vision module including a camera for taking an image in frontof the robot; a motor control module including a drive motor for turningthe robot horizontally, an association module for combining events fromthe said audition module, the said visual module and the said motorcontrol module to produce streams; and an attention control module foreffecting attention control on the basis of the streams produced by thesaid association module, whereby: the said audition module in responseto sound signals from the said microphones is adapted to extract pitchestherefrom, separate their sound sources from each other and locate soundsources such as to identify a sound source as at least one speaker,thereby extracting an auditory event; the said vision module on thebasis of an image taken by the camera is adapted to identify by face,and locate, each such speaker, and then to identify it as the soundsource, thereby extracting a visual event therefor; the said motorcontrol module in response to a rotary position of the said drive motoris adapted to extract a motor event therefrom; the said associationmodule on the basis of the said auditory event, the said visual eventand the said motor event is adapted to form an auditory stream and avisual stream upon determining a direction in which each such speakerlies on the basis of directional information derived from locating thesound source of the said auditory event and that from locating the faceof the said visual event, and then to associate these streams with eachother to form an association stream; and the said attention controlmodule in effecting the attention control is adapted to make a plan fora course in which to control the said drive motor in the said motorcontrol module in accordance with the said streams.

[0033] In this robot visuoauditory system according to the presentinvention, the said association module when forming the said auditoryand visual streams is preferably adapted to synchronize the saidauditory, visual and motor events which are formed asynchronously witheach other.

[0034] In this robot visuoauditory system according to the presentinvention, the said audition module is preferably adapted to identifyeach such speaker upon detecting a voice MFCC from the sound signals,and the said association module is preferably adapted to select the saidauditory stream and the said visual stream to which the said auditoryevent and the said visual event are to be connected, by specifying eachsuch speaker on the basis of speaker identification for the saidauditory event and speaker identification for the said visual event.

[0035] In this robot visuoauditory system according to the presentinvention, with a plurality of streams coming close to one another, thesaid association module is preferably adapted to select the saidauditory stream and the said visual stream to which the said auditoryevent and the said visual event are to be connected, by referring totemporal flows of the said auditory and visual events.

[0036] In this robot visuoauditory system according to the presentinvention, the said association module is preferably adapted toassociate the said auditory stream and the said visual stream with eachother to form the said association stream if they are stronglyassociated with each other and to disassociate and extinguish the saidassociation stream when the said auditory and visual streams which makeup the same become weakly associated with each other.

[0037] According to the system makeup mentioned above, the auditionmodule receives sounds collected by microphones from external objects assound sources and extracts pitches from the collected sounds utilizingtheir harmonic structures to find the directions in which the soundsources exist, respectively, and to identify the individual speakers asthe respective object sound sources and then extracts their own auditoryevents.

[0038] Also, the vision module receives images of the speakers taken bya camera and extracts from the images the respective visual events ofthe individual speakers upon identifying the faces of the speakers bypattern recognition and locating the speakers.

[0039] Further, the motor control module extracts a motor event upondetecting a direction in which the robot faces on the basis of a rotaryposition of the drive motor turning the robot horizontally.

[0040] At this point it should be noted that the term “event” as usedherein is intended to refer to a collection of a series of data withtime information formed reflecting the occurrence that at a given pointof time, a sound or face is detected or a feature such a pitch and itsdirection is extracted to identify, recognize or discriminate thespeaker or face, or a collection of positional data with timeinformation for the drive motor at a given time instant or instants. Theterm “stream” as used herein is intended to refer to eventsinterconnected so as to continue with time while effecting errorcorrection.

[0041] Here, the association module on the basis of the auditory event,the visual event and the motor event thus extracted forms an auditorystream and a visual stream for each such speaker upon determining adirection in which each such speaker lies on the basis of directionalinformation derived from locating the sound source of the auditory eventand that from locating the face of the visual event, and then toassociate these streams with each other to form an association stream.

[0042] In this case, the association module determines the direction inwhich each such speaker lies on the basis of directional informationderived from locating the sound source of the said auditory event andthat from locating the face of the said visual event, and then to forman association stream with reference to the determined direction of eachsuch speaker.

[0043] And, the attention control module effect the attention control onthe basis of these streams whereby a plan for the course in which tocontrol the said drive motor in the said motor control module is made.

[0044] The term “attention” as used herein is intended to causing therobot to pay attention auditorily and/or visually to an object speaker.Likewise, the term “attention control” as used herein is intended torefer to the motor control module causing the robot to change itsposition and/or direction or orientation and thereby to pay attention tothe object speaker.

[0045] And then, the attention control module by controlling the drivemotor in the motor control module on the basis of a result of the plandirects the robot to face opposite to the object speaker. Thus,directing the robot to face opposite to the object speaker allows theaudition module to collect through the microphones a voice of the objectspeaker in the front direction in which the voice can be collected atthe highest sensitivity and thereby to locate the object speakeraccurately, while permitting at the same time the vision module to takethrough the camera a clear-cut image of that object speaker.

[0046] Also, even when either its auditory or visual event is missing orbroken off, an object speaker can be perceived through attention controlon the basis of its visual or auditory event that remains extant, andthereby to permit the motor control module to be controlled upon seizingthe direction of the object speaker accurately.

[0047] Accordingly, the audition module, the vision module and the motorcontrol module are made to cooperate with the association module and theattention control module to allow determining the direction of each suchspeaker on the basis of the directional information of locating thesound source from the auditory event and locating the speaker from thevisual event and thus to allow the ambiguities which the audition andvision of the robot individually possess to complement each other,thereby rising the so-called robustness of the system. Moreover, morethan one speakers can be thereby individually perceived without fail.

[0048] In case the association module when forming an auditory and avisual stream synchronizes an auditory, a visual and a motor event witheach other if they are formed asynchronously with each other, the factthat the auditory, visual and motor events which are formedasynchronously with each other are synchronized with each other allowsdifferent portions of the periods in which these events are formed inthe association module and different portions of the delay times thenentailed to be absorbed and, and allows an auditory stream made of theauditory event and a visual stream made of the visual event if theyexist close to each other in distance to be associated with each otherto form a higher order association stream.

[0049] If the audition module identifies each such speaker upondetecting a voice MFCC (Mel Frequency Cepstrum Coefficient) from thesound signals, and the association module selects the auditory streamand the visual stream to which the auditory event and the visual eventare to be connected, by specifying each such speaker on the basis ofspeaker identification for the auditory event and speaker identificationfor the visual event, each such speaker can be identified by the MFCC ofa voice from its auditory event, the individual speakers can beidentified by their own auditory and visual events, respectively. Thus,in forming auditory and visual streams of a plurality of speakers, anauditory and a visual event of a speaker can be connected to an auditoryand a visual stream of the same speaker to specify each speakeraccurately. Also, even if one of an auditory and a visual event of aspeaker is broken off midway, an attempt to identify the speaker can becontinued with the other event. Thus, if voices from a plurality ofspeakers are detected from a same direction, this higher-orderintegration of audition and vision allows the speakers to be separatedfrom one another accurately and individually tracked upon identifyingthe speakers

[0050] If with a plurality of streams coming close to one another, theassociation module selects the auditory stream and the visual stream towhich the auditory event and the visual event are to be connected, byreferring to temporal flows of the auditory and visual events, it ispossible, even in case a plurality of speakers come close to one anotherto cause the auditory and visual streams of these speakers to cross oneanother, to keep the auditory and visual streams of any particularspeaker accurately upon predicting its movement. Thus, the ambiguitieswhich these auditory and visual streams possess individually are allowedto complement each other, thus raising the so-called robustness of thesystem while permitting a plurality of speakers to be trackedindividually without fail.

[0051] In case the association module is adapted to associate anauditory stream and a visual stream with each other to form anassociation stream if they are strongly associated with each other andto disassociate and extinguish the association stream when the auditoryand visual streams which make up the same become weakly associated witheach other, it is possible to form an association stream accurately foreach speaker, thus circumventing the ambiguities which the auditory andvisual streams possess individually, thereby permitting each speaker tobe discriminated from another accurately.

[0052] Also, selecting the angle suitably allows a speaker while movingto be captured without fail, in a manner of predicting a movement of thespeaker for its specifying.

[0053] In this way, dealing with events and streams, each symbolic, usedto represent auditory and visual information allows displaying in realtime in the display since the amount of data can be largely compressedcompared with dealing with unsymbolized sound and image data.

[0054] The third object mentioned above is achieved in accordance withthe present invention in a third aspect thereof by a visuoauditorysystem for a robot, characterized in that it comprises: an auditionmodule including at least a pair of microphones for collecting externalsounds; a vision module including a camera for taking an image in frontof the robot; a motor control module including a drive motor for turningthe robot horizontally, an association module for combining events fromthe said audition module, the said visual module and the said motorcontrol module to produce streams; and an attention control module foreffecting attention control on the basis of the streams produced by saidassociation module, whereby: the said audition module in response tosound signals from the said microphones is adapted to extract pitchestherefrom, separate their sound sources from each other and locate soundsources such as to identify a sound source as at least one speaker,thereby extracting an auditory event; the said vision module on thebasis of an image taken by the camera is adapted to identify by face,and locate, each such speaker, thereby extracting a visual eventtherefor; the said motor control module in response to changes in arotary position of the said drive motor is adapted to extract a motorevent therefrom; the said association module on the basis of theauditory event from the said audition module, the visual event from thesaid vision module and the motor event from the said motor controlmodule is adapted to form an auditory stream and a visual stream andthen associate these streams with each other to form an associationstream; and the said attention control module in effecting the attentioncontrol is adapted to make a plan for a course in which to control thesaid drive motor in the said motor control module in accordance with thesaid streams, wherein the said system further comprises a display fordisplaying at least a portion of auditory information for the saidaudition module, visual information for the said vision module, motorinformation for the said motor control module and stream information forthe said association module.

[0055] In this robot visuoauditory system according to the presentinvention, the said display preferably includes an auditory display fordisplaying as the auditory information, a spectrum of sound signals fromsound sources and peaks extracted therefrom and the said auditory event.

[0056] In this robot visuoauditory system according to the presentinvention, the said auditory display is preferably adapted to displaythe said auditory event in the form of a circle with its centerrepresenting the robot, its axis of ordinates representing relativeangles of the direction that the robot may face, its axis of abscissasrepresenting pitches and its diameter representing a belief factor.

[0057] In this robot visuoauditory system according to the presentinvention, the said display preferably includes a visual display fordisplaying as the said visual information, an image of an extracted facetaken by the camera and shown within a frame, and the said visual event.

[0058] In this robot visuoauditory system according to the presentinvention, the said visual display is preferably adapted to display thesaid visual event in the form of list of face identifications and facelocations extracted with degrees of firmness.

[0059] In this robot visuoauditory system according to the presentinvention, the said display preferably includes a motor display fordisplaying as said motor information, the direction in which and thespeed at which the robot moves, three-dimensionally in real time.

[0060] In this robot visuoauditory system according to the presentinvention, the said display preferably includes a stream display fordisplaying as the stream information, a stream chart and a radar chart.

[0061] In this robot visuoauditory system according to the presentinvention, the said stream display is preferably adapted to display inthe said stream chart, the said auditory, visual and association streamsconstituting the said stream information.

[0062] In this robot visuoauditory system according to the presentinvention, the said stream display is preferably adapted to display inthe said radar chart, an instantaneous status of the said streams with avisual field of the camera and upon sound source location.

[0063] The system so configured as mentioned above allows the auditionmodule to extract pitches from sounds collected by the microphones fromexternal objects as sound sources by using a harmonic structure thereofto find and identify the direction in which each of the sound sources asindividual speakers lies, thereby extracting an auditory event for eachobject speaker.

[0064] Also, the vision module on the basis of an image taken by acamera is allowed to identify by face, and locate, each such speaker,thereby extracting a visual event therefor.

[0065] Further, the motor control module is allowed to extract a motorevent by detecting the direction in which the robot faces on the basisof all the rotary position of the drive motor for turning the robothorizontally.

[0066] At this point it should be noted that the term “event” as usedherein is intended to refer to the occurrence that at a given point oftime, a sound or face is detected or a feature such a pitch and itsdirection is extracted to identify, recognize or discriminate thespeaker or face, or the state that the drive motor is in rotation. Theterm “stream” as used herein is intended to refer to eventsinterconnected so as to continue with time while effecting errorcorrection.

[0067] Here, the association module on the basis of an auditory, avisual and a motor event thus extracted, respectively, forms an auditoryand a visual stream for each speaker and, upon associating these streamswith each other, form an association stream. Then, the attention controlmodule on the basis of these streams effects attention control, therebymaking a plan of the course in which to control the drive motor in themotor control module.

[0068] The term “attention” as used herein is intended to causing therobot to pay attention auditorily and/or visually to an object speaker.Likewise, the term “attention control” as used herein is intended torefer to the motor control module causing the robot to change itsposition and/or direction or orientation and thereby to pay attention tothe object speaker.

[0069] Thus, the attention control module directs the robot to facetowards an object speaker by controlling the drive motor in the motorcontrol module on the basis of the plan made. Thus, directing the robotto face opposite to the object speaker allows the audition module tocollect through the microphones a voice of the object speaker in thefront direction in which the voice can be collected at a high S/N ratioand thereby to locate the object speaker accurately, while at the sametime permitting the vision module to take through the camera a clear-cutimage of that object speaker.

[0070] Also, even when either its auditory or visual event is missing orbroken off, an object speaker can be perceived by the attention controlmodule on the basis of only its visual or auditory event that remainsextant, thereby permitting the motor control module to be controlled inreal time.

[0071] Accordingly, making the audition module, the vision module andthe motor control module to cooperate with the association module andthe attention control module allows the ambiguities which the auditionand vision of the robot individually possess to complement each other,thereby rising the so-called robustness of the system. Moreover, morethan one speakers can be thereby individually perceived.

[0072] Further, displays are included to display at least a portion ofthe auditory information by the audition module, the visual informationby the vision module and the motor information by the motor controlmodule and to visualize the real-time processing by the associationmodule, thereby permitting the status of the real-time processing to beinstantly perceived by vision.

[0073] If the display includes an auditory display for displaying as theauditory information, a spectrum of sound signals from sound sources andpeaks extracted therefrom and the auditory event, and if the auditorydisplay is adapted to display the auditory event in the form of a circlewith its center representing the robot, its axis of ordinatesrepresenting relative angles of the direction that the robot may face,its axis of abscissas representing pitches and its diameter representinga belief factor, the auditory information can be seized instantly byseeing the auditory display.

[0074] If the display includes a visual display for displaying thevisual information, an image of an extracted face taken by the cameraand shown within a frame, and the visual event, and if the visualdisplay is adapted to display the visual event in the form of list offace identifications and face locations extracted with degrees offirmness, the visual information can be seized instantly by seeing thevisual display.

[0075] If the display includes a motor display for displaying as themotor information, the direction in which and the speed at which therobot moves, three-dimensionally in real time, the motor information canbe seized instantly by seeing the motor display.

[0076] If the display includes a stream display for displaying as thestream information, a stream chart and a radar chart, if the streamdisplay is adapted to display in the stream chart, the auditory, visualand association streams constituting the stream information, and if thestream display is adapted to display in the radar chart, aninstantaneous status of the streams with a visual field of the cameraand upon sound source location, an instantaneous status of the streamsand changes in their states can be seized instantly by seeing the streamdisplay.

[0077] In this way, dealing with events and streams, each symbolic, usedto represent auditory and visual information allows displaying in realtime in the display since the amount of data can be largely compressedcompared with dealing with unsymbolized sound and image data.

[0078] The fourth object mentioned above is achieved in accordance withthe present invention in a fourth aspect thereof by a visuoauditorysystem for a robot, characterized in that it comprises: an auditionmodule including at least a pair of microphones for collecting externalsounds; a vision module including a camera for taking an image in frontof the robot; a motor control module including a drive motor for turningthe robot horizontally, an association module for combining events fromthe said audition module, the said visual module and the said motorcontrol module to produce streams; and an attention control module foreffecting attention control on the basis of streams produced by theassociation module, whereby: the said audition module in response tosound signals from the microphones is adapted to extract pitchestherefrom, separate their sound sources from each other and locate soundsources such as to identify a sound source as at least one speaker,thereby extracting an auditory event; the vision module on the basis ofan image taken by the camera is adapted to identify by face, and locate,each such speaker, thereby extracting a visual event therefor; the motorcontrol module in response to a rotary position of the drive motor isadapted to extract a motor event therefrom; the association module onthe basis of the auditory event from the audition module, the visualevent from the vision module and the motor event from the motor controlmodule is adapted to form an auditory stream and a visual stream andthen associate these streams with each other to form an associationstream; and the attention control module in effecting the attentioncontrol is adapted to make a plan for a course in which to control thedrive motor in the motor control module in accordance with the streams,thereby effecting a visuoauditory servo for the robot.

[0079] In this robot visuoauditory system according to the presentinvention, when the attention control module effects the attentioncontrol, higher priority is preferably given in the order of theassociation stream, the auditory stream and the visual stream.

[0080] In this robot visuoauditory system according to the presentinvention, such priority may be altered according to circumstances.

[0081] In this robot visuoauditory system according to the presentinvention, the said attention control module in the existence of morethan one auditory or visual streams is preferably adapted to select fromthem one auditory or visual stream according to circumstances, and toform an association stream, thereby effecting the attention control onthe basis of such an auditory, visual or association stream.

[0082] The system so configured as mentioned above allows the auditionmodule to extract pitches from sounds collected by the microphones fromexternal objects as sound sources by using a harmonic structure thereofto find and identify the direction in which each of the sound sources asindividual speakers lies, thereby extracting an auditory event for eachobject speaker.

[0083] Also, the vision module on the basis of an image taken by acamera is allowed to identify by face, and locate, each such speaker,thereby extracting a visual event therefor.

[0084] Further, the motor control module is allowed to extract a motorevent by detecting the direction in which the robot faces on the basisof all the rotary position of the drive motor for turning the robothorizontally.

[0085] At this point it should be noted that the term “event” as usedherein is intended to refer to the occurrence that at a given point oftime, a sound or face is detected or a feature such a pitch and itsdirection is extracted to identify, recognize or discriminate thespeaker or face, or the state that the drive motor is in rotation. Theterm “stream” as used herein is intended to refer to eventsinterconnected so as to continue with time while effecting errorcorrection.

[0086] Here, the association module on the basis of an auditory, avisual and a motor event thus extracted, respectively, forms an auditoryand a visual stream for each speaker and, upon associating these streamswith each other, form an association stream. Then, the attention controlmodule on the basis of the association stream effects attention control,thereby making a plan of the course in which to control the drive motorin the motor control module.

[0087] The term “attention” as used herein is intended to causing therobot to pay attention auditorily and/or visually to an object speaker.Likewise, the term “attention control” as used herein is intended torefer to the motor control module causing the robot to change itsposition and/or direction or orientation and thereby to pay attention tothe object speaker.

[0088] Thus, the attention control module directs the robot to facetowards an object speaker by visuoauditorily servo-controlling the drivemotor in the motor control module on the basis of the plan made. Thus,directing the robot to face opposite to the object speaker allows theaudition module to collect through the microphones a voice of the objectspeaker in the front direction in which the voice can be collected at ahigh sensitivity and thereby to locate the object speaker accurately,while at the same time permitting the vision module to take through thecamera a clear-cut image of that object speaker.

[0089] Here, with the association control module which is designed toeffect attention control on the basis of association streams using bothauditory and visual information and thus to servo-control itvisuoauditorily upon recognizing the fact that a voice and a face(image) are derived from an identical object, it is possible toservo-control the robot with accuracy higher than in the conventionalauditory or visual servo system relying on either auditory or visualinformation only.

[0090] Thus, the visuoauditory servo-control system of the invention inwhich the audition, vision and motor control modules are made tocooperate with the association and attention control modules allows theambiguities which the audition and vision of the robot individuallypossess to complement each other, thereby rising the so-calledrobustness of the system. Moreover, more than one speakers can bethereby individually perceived.

[0091] A robot visuoauditory system according to the present inventioncan be incorporated into humanoid robots and various animaloid robots aswell a variety of plantoid (plant-type) robot. In addition, it can beincorporated widely into active electronic machines which areautomatically controlled, such as a robot not with any figurative butwith an abstractive form.

[0092] In this way, dealing with events and streams, each symbolic, usedto represent auditory and visual information allows displaying in realtime in the display since the amount of data can be largely compressedcompared with dealing with unsymbolized sound and image data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0093] The present invention will better be understood from thefollowing detailed description and the drawings attached hereto showingcertain illustrative embodiments of the present invention. In thisconnection, it should be noted that such forms of embodiment illustratedin the accompanying drawings hereof are intended in no way to limit thepresent invention but to facilitate an explanation and understandingthereof. In the drawings:

[0094]FIG. 1 is a front elevational view illustrating the appearance ofa humanoid robot incorporating a robot visuoauditory system thatrepresents one form of embodiment of the present invention;

[0095]FIG. 2 is a side elevational view of the humanoid robot shown inFIG. 1;

[0096]FIG. 3 is an enlarged view diagrammatically illustrating themakeup of the head portion of the humanoid robot shown in FIG. 1;

[0097]FIG. 4 is a block diagram illustrating the electrical makeup of arobot visuoauditory system for the humanoid robot shown in FIG. 1;

[0098]FIG. 5 is an enlarged block diagram illustrating the electricalmakeup of an audition module shown at Block 1 in FIG. 4;

[0099]FIG. 6 is an enlarged block diagram illustrating the electricalmakeup of a vision module shown at Block 2 in FIG. 4;

[0100]FIG. 7 is an enlarged block diagram illustrating the electricalmakeup of a motor control module shown at Block 3 in FIG. 4;

[0101]FIG. 8 is an enlarged block diagram illustrating the electricalmakeup of an interactive module shown at Block 4 in FIG. 4

[0102]FIG. 9 is an enlarged block diagram illustrating the electricalmakeup of an association module shown at Block 5 in FIG. 4;

[0103]FIG. 10 is a diagram illustrating an example of the operation of arobot working as a party receptionist with the robot visuoauditorysystem of FIG. 4;

[0104]FIG. 11 is a diagram illustrating examples of images which viewersfor the audition and vision modules display on their screens (A) and (B)in the robot visuoauditory system of FIG. 4, respectively;

[0105]FIG. 12 is a diagram illustrating examples of images which viewersfor the motor control and association modules display on their screens(A) and (B) in the robot visuoauditory system of FIG. 4, respectively;

[0106]FIG. 13 is a diagram illustrating the peak extraction, the soundsource location and the sound source separation by the audition modulein the robot visuoauditory system of FIG. 4;

[0107]FIG. 14 is a diagram illustrating the process of forming streamsby the association module in the robot visuoauditory system of FIG. 4;

[0108]FIG. 15 is a diagram illustrating an example of the operation ofthe robot working as a companion robot with the robot visuoauditorysystem of FIG. 4;

[0109]FIG. 16 is a diagram illustrating examples of what a radar chart(A), a stream chart (B) and camera images (C) may exhibit, respectively,at various time instants when the companion robot operates with therobot visuoauditory system of FIG. 4; and

[0110]FIG. 17 is a graphic diagram illustrating how the robot isdirected (A) and how the visual and auditory streams vary (B) and (C) inthe example of operation shown in FIG. 16.

BEST MODES FOR CARRYING OUT THE INVENTION

[0111] Hereinafter, an explanation in detail is given in respect ofsuitable forms of embodiment of the present invention with reference tothe drawing figures.

[0112] FIGS. 1 to 4 in combination show an overall makeup of anexperimental human-type robot or humanoid incorporating a robotvisuoauditory system according to the present invention as one form ofembodiment thereof.

[0113] In FIG. 1, the humanoid indicated by reference character 10 isshown made up as a robot with four degrees of freedom (4DOFs) andincluding a base 11, a body portion 12 supported on the base 11 so as tobe rotatable uniaxially about a vertical axis, and a head portion 13supported on the body portion 12 so as to be capable of swingingtriaxially (about a vertical axis, a lateral horizontal axis extendingfrom right to left or vice versa and a longitudinal horizontal axisextending from front to rear or vice versa).

[0114] The base 11 may either be disposed fixed in position or arrangedoperable as a foot of the robot. Alternatively, the base 11 may bemounted on a movable carriage or the like.

[0115] The body portion 12 is supported rotatably relative to the base11 so as to turn about the vertical axis as indicated by the arrow A inFIG. 1. It is rotationally driven by a drive means not shown and iscovered with a sound insulating cladding as illustrated.

[0116] The head portion 13 is supported from the body portion 12 bymeans of a connecting member 13 a and is made capable of swingingrelative to the connecting member 13 a, about the longitudinalhorizontal axis as indicated by the arrow B in FIG. 1 and also about thelateral horizontal axis as indicated by the arrow C in FIG. 2. And, ascarried by the connecting member 13 a, it is further made capable ofswinging relative to the body portion 12 as indicated by the arrow D inFIG. 1 about another longitudinal horizontal axis extending from frontto rear or vice versa. Each of these rotational swinging motions A, B, Cand D for the head portion 13 is effected using a respective drivemechanism not shown.

[0117] Here, the head portion 13 as shown in FIG. 3 is covered over itsentire surface with a sound insulating cladding 14 and at the same timeis provided at its front side with a camera 15 as the vision means incharge of robot's vision and at its both sides with a pair ofmicrophones 16 (16 a and 16 b) as the auditory means in charge ofrobot's audition or hearing.

[0118] The cladding 14 is composed of a sound absorbing synthetic resinsuch as, for example, urethane resin and with the inside of the headportion 13 covered virtually to the full is designed to insulate andshield sounds within the head portion 13. Likewise, the body portion 12is covered with a cladding made of such a sound absorbing syntheticresin.

[0119] The camera 15 may be of a known design, and thus any commerciallyavailable camera having three DOFs (degrees of freedom): so-calledpanning, tilting and zooming functions is applicable here.

[0120] The microphones 16 are attached to the head portion 13 so that inits side faces they have their directivity oriented towards its front.Here, the right and left hand side microphones 16 a and 16 b as themicrophone set 16 as will be apparent from FIGS. 1 and 2 are mountedinside of, and thereby received in, stepped bulge protuberances 14 a and14 b, respectively, of the cladding 14 with their stepped faces havingone or more openings and facing to the front at the both sides and arethus arranged to collect through these openings a sound arriving fromthe front. And, at the same time they are suitably insulated from soundsinterior of the cladding 14 so as not to pick up such sounds to anextent possible. This makes up the microphones 16 a and 16 b as what iscalled a binaural microphone. It should be noted further that thecladding 14 in the areas where the outer microphones 16 a and 16 b aremounted may be shaped so as to resemble human outer ears.

[0121]FIG. 4 shows the electrical makeup of a visuoauditory systemincluding the microphone 16 and the camera 15. Referring to FIG. 4, thevisuoauditory system indicated by reference character 17 is hereconfigured for a robot serving as a party receptionist and companion andis made up of an audition module 20, a vision module 30, a motor controlmodule 40, a interactive module 50 and an association module 60.

[0122] Referring next to FIGS. 5 to 9 which show these modules asenlarged, a further explanation of the robot visuoauditory system isgiven. For the sake of convenience of the description, the auditionmodule 20, the visual module 30, the motor control module 40, theinteractive module 50 and the association module 60 are shown asenlarged, and indicated by blocks 1, 2, 3, 4 and 5, in FIGS. 5, 6, 7, 8and 9, respectively.

[0123] Here, the association module 60 (block 5 in FIG. 9) is made up ofa server of a client-server system while the other modules, namely theaudition module 20 (block 1 in FIG. 5), the visual module 30 (block 2 inFIG. 6), the motor control module 40 (block 3 in FIG. 7), and theinteractive module 50 (block 4 in FIG. 8) are made up of respectiveclients in the system to operate asynchronously with each other. Theserver and the clients are each made up of, e.g., a personal computerand LAN connected with each other via a network such as 100Base-Taccording to, e.g., TCP/IP protocol.

[0124] Further, each of the modules 20, 30, 40, 50 and 60 is distributedhierarchically into, specifically from bottom to top, a device layer, aprocessing layer, a characterizing layer and an event layer.

[0125] As shown in FIG. 5, the audition module 20 comprises themicrophone 16 as the device layer, a peak extractor 21, a sound sourcelocator 22, a sound source separator 23 and a speaker specifier 23 a asthe process layer, a pitch 24 and a horizontal direction 25 as thecharacterizing layer (data), and an auditory event former 26 and aviewer 27 as the event layer.

[0126] Here, the audition module 20 acts as shown in FIG. 13. Referringto FIG. 13, the audition module 20 as indicated by reference letteringX1 receives a sound signal picked up from the microphone 16 and sampled,e.g., at 48 kHz and with 16 pitches, and as indicated by referencelettering X2 performs its frequency analysis by FFT (Fast FourierTransformation) to form its spectrum for each of the left and leftchannels as indicated by reference lettering X3. Then, the auditionmodule 20 at its peak extractor 21 extracts a series of peaks for eachof the left and right channels whereby identical or resembling peaksbetween the left and right channels are paired.

[0127] Here, the peak extraction is carried out using a band-pass filterthat transmits only the data for those local peaks which are higher inpower than a threshold value and of frequencies, e.g. between 90 Hz and3 kHz, such as to exclude low-frequency noises and a high-frequencyband. The threshold value here is defined upon measuring backgroundnoises and further as a value having a sensitivity parameter addedthereto that is, e.g., 10 dB.

[0128] And, the audition module 20 utilizing the fact that each peak hasa harmonic structure associated therewith extracts local peaks havingtheir harmonic structures by turns from the local peak whose frequencyis the lowest. Then, as indicated by reference lettering X4 the soundsource separator 23 applies IFFT (Inverse Fast Fourier Transformation)to a set of the extracted peaks as a sound to separate from the mixedsound from various sound sources, a sound signal for each of the soundsources as indicated by reference lettering X5. Whereupon, the auditionmodule 20 at its sound source specifier 22 as indicated by referencelettering X6 selects those sound signals from the left and rightchannels which are of an identical frequency to find both an IPD(Interaural Phase Difference) and an IID (Interaural IntensityDifference), e.g., at intervals of five degrees.

[0129] Then, the audition module 20 at its sound source localizer 22,utilizing the so-called auditory epipolar geometry to make an IPD-Phhypothesis to hypothetical inference in a range of ±90° with the frontof the robot 10 assumed to be 0°, computes a distance d(θ) between aseparated sound and each hypothesis by way of the equation (1) givenbelow. $\begin{matrix}{{d(\theta)} = {\frac{1}{n_{f < {1.5{\quad \quad}{KHz}}}}{\sum\limits_{f = {FO}}^{1.5\quad {KHz}}\quad \frac{\left( {{P_{h}\left( {\theta,f} \right)} - {P_{s}(f)}} \right)^{2}}{f}}}} & (1)\end{matrix}$

[0130] where n_(f)<1.5 kHz, which are the harmonics whose frequenciesare less than 1.5 kHz. They were so set in this experiment in view ofthe fact that IDP for the left and right hand side microphones 15 fromthe base line was found effective to frequencies less than 1.2 to 1.5kHz.

[0131] IID like IPD can be found a difference in power between the leftand right hand side microphones for each harmonic of the separatedsound. As for the IID, however, which of the left and right hand sideshas the sound source found on is determined not using the hypotheticalassumption but the discriminant function (2) given below.$\begin{matrix}{I = {\sum\limits_{f = {1.5\quad {KHz}}}^{3\quad {KHz}}\quad {I_{S}(f)}}} & (2)\end{matrix}$

[0132] That is, with the IID of each harmonic of frequency f expressedby Is(f), if I is positive, then the sound source is determined to lieon the left hand side of the robot. If I is negative, it is determinedto lie on the right hand side of the robot. If I is zero, it isdetermined to lie in the front of the robot.

[0133] Here, since the hypothetical making of IID, which requiresenormous amounts of computation with the shape of the head of the robottaken into account, is impractical for real-time processing and is notadopted here as it was done for the IPD.

[0134] Then, the matching of the difference between IPD and IID iseffected as indicated by reference lettering X8.

[0135] Thereafter, the sound source locater 22 in the audition module 20computes the belief factor BF_(IPD) (θ) of the IPD from the distanced(θ) by applying thereto the probability density function defined below$\begin{matrix}{{{BF}_{IPD}(\theta)} = {\int_{- \infty}^{\frac{{d{(\theta)}} - m}{\sqrt{\frac{s}{n}}}}{\frac{1}{\sqrt{2\pi}}{\exp \left( {- \frac{x^{2}}{2}} \right)}\quad {x}}}} & (3)\end{matrix}$

[0136] where m and s are the mean and variance of d(θ), respectively,and n is the number of distances d.

[0137] Also, the belief factor BF_(IID)(θ) of the IID is found to be0.35 and 0.65 where 30°<θ≦90° and if the I is the plus and the minus,respectively. It is found to be 0.5 and 0.5 where −30°<θ≦30° and if theI is the plus and the minus, respectively. It is found to be 0.65 and0.35 where −90°≦θ<−30° if the I is the plus and the minus, respectively.

[0138] The IPD's Brief factor BF_(IPD)(θ) and the IID's Brief factorBF_(IID)(θ) derived in this way are combined by the equation (4) givenbelow according to the Dempster-Shafer theory to make a new firmness ofbelief BF_(IPD+IID)(θ).

BF _(IPD+IID)(θ)=BF _(IPD)(θ)BF _(IID)(θ)+(1−BF _(IPD)(θ))BF_(IID)(θ)+BF _(IPD)(θ)(1−BF _(IID)(θ))  (4)

[0139] Thus, the audition module 20 at its auditory event former 26makes an auditory event 28 from the list of upper 20 values of thefirmness of brief BF_(IPD+IID)(θ) higher in the likelihood indicatingthe sound source and their indicating directions (θ) and the pitches.

[0140] On the other hand, the speaker specifier 23 a may find the MFCC(Mel Frequency Cepstrum Coefficient) of a speaker, from the soundsignals from the microphone 16 and compare the same with those ofpre-registered speakers to identify who the speaker is.

[0141] Thus, the audition module 20 at its auditory event former 26makes an auditory event 28 from the list of upper 20 values of thefirmness of brief BF_(IPD+IID)(θ) higher in the likelihood indicatingthe sound source and their indicating directions (θ) and the pitches aswell as the identification of the speaker.

[0142] In this way, the audition module 20 is designed to identify atleast one speaker (speaker identification) to extract an auditory eventthereof by extracting pitches from sound signals and separating andlocating a sound source or extracting the pitches from the sound signalsand separating and locating the sound source and upon MFCC comparison,and then to transmit the extracted auditory even via a network 70 to theassociation module 60. The abovementioned process in the audition module20 is carried out at intervals of 40 milliseconds.

[0143] The viewer 27 is included to display on a client screen theauditory event formed as described above. Specifically, as shown in FIG.11(A), on a left hand side window 27 a of the screen may be displayed,e.g., with a black curve 27 b, the power spectrum of the auditory event28 being measured. Its peaks may be displayed, e.g., with blue verticallines 27 c. The levels of a background noise automatically measured maybe displayed with a red curve 27 d. On a right hand side window 27 e ofthe screen may be displayed a graph of the auditory event 28 having therelative directional angles plotted along the ordinate and the pitches(frequencies) plotted along the abscissa. Here, each auditory event isrepresented with a circle 27 f having its diameter indicating thefirmness of belief of locating a sound source. This permits an auditoryevent 28 to be perceived at once by seeing the viewer 27 displaying theextracted peaks, its power spectrum and the circle above the graph.

[0144] As shown in FIG. 6, the vision module 30 comprises a camera 15 asa device layer, a face finder 31, a face identifier 32 and a facelocator 33 each as a process layer, a face ID 34 and a face direction 35each as a characterizing layer (data), and a visual event former 36 anda viewer 37 as an event layer. In this makeup of the vision module 30,the face finder 31 detects the face of each speaker, e.g., in terms ofits skin color, on the basis of an image signal from the camera, theface identifier 32 looks up the face in a pre-registered face data base38 and, if an applicable face is found in the data base 38, identifiesthe face by its ID 34, and then the face locator 33 determines thedirection 36 in which the identified face lies. If the face identifier32 fails to identify the face detected by the face finder 31 in the database 38, then a face learner 38 a enters it in the data base 38.

[0145] Here, if the face finder 31 detects a plurality of faces from theimage signal, then the vision module 30 performs, for each of the facesdetected, the process mentioned above, namely identifies and locateseach face detected, followed by its tracking. Then, with the detectedfaces tending to change in their size, direction and brightness as oftenas not, the face finder 31 is here made capable of accurately detectingmore that one faces within a time period of 200 milliseconds by matchingpatterns in combination on the basis of skin color extraction andcorrelative operations.

[0146] Further, the face identifier 32 projects each face regional imagedetected by the face finder 31 in a discriminant space to compute itsdistance d from the face data pre-registered in the face database 38.Since this distance d is dependent on the number of registered faces(L), it is converted to a firmness of belief Pv that is independent ofany parameter, by way of the equation (5) given below. $\begin{matrix}{P_{v} = {\int_{\frac{d^{2}}{2}}^{\infty}{^{- t}t^{\frac{L}{2}1}\quad {t}}}} & (5)\end{matrix}$

[0147] Here, since the discriminant matrix that becomes the basis of thediscriminant space can be renewed by a known online LDA (LinearDiscriminant Analysis) with less computation than by the normal LDA, itis possible to register face data in real time.

[0148] The face locator 33 converts a face position in a two-dimensionalimage plane into that in a three-dimensional space. Assuming that a facecomprises w×w pixels positioned at (x, y) in the image plane and havinga width X and a height Y, the face position in the three-dimensionalspace is found as a set of the directional angle θ, height φ anddistance r given by equations (6), (7) and (8) as follows:

r=C_(l)/w   (6) $\begin{matrix}{\theta = {\sin^{- 1}\left( \frac{x - \frac{X}{2}}{C_{2}r} \right)}} & (7) \\{\varphi = {\sin^{- 1}\left( \frac{\frac{Y}{2} - y}{C_{2}r} \right)}} & (8)\end{matrix}$

[0149] where C1 and C2 are constants defined by the size (X, Y) ofsearch image, the angle of view of the camera and the size of the actualface.

[0150] Then, for each of the faces the vision module 30 at its visualevent generator 36 makes a visual event 39 from the face ID (name) 34and the face location 35. The visual event 39 specifically comprises theface ID (name) 34 and position (distance r, horizontal angle θ andvertical angle φ) each with five upper values of the firmness of belief.

[0151] The viewer 37 is included to display the visual event and morespecifically to display, as shown in FIG. 11(B), an image or picture 37a by the camera 15, a list 37 b of the respective face IDs for theextracted faces and their positions, and a list 37 c of the respectiveface IDs for the extracted faces, their values of firmness of belief andtheir distances, on the client's screen.

[0152] Here, in the image or picture 37 a by the camera 15, a facedetected and identified is displayed enclosed with a rectangular frame.As shown, more than one faces are detected and displayed enclosed withthe rectangular frames 37 d (colored, e.g., in red) and 37 e (colored,e.g., in yellow), respectively, showing that they are each identified.Attendant thereon, the listing in the list 37 b as shown is displayedfor each of the faces detected and identified. This permits the visualevent 39 to be instantly perceived by vision from the extracted facesdisplayed enclosed with the frames 37 c and 37 d, the face location list37 b and the list of the extracted face IDs with their values offirmness of belief, namely by seeing the display of the viewer 37.

[0153] As shown in FIG. 7, the motor control module 40 comprises a motor41 and a potentiometer 42 as a device layer, a PWM control circuit 43,an AD converter circuit 44 and a motor controller 45 as a process layer,a robot direction 46 as a characterizing layer, a motor event former 47as an motor event layer, and a viewer 48.

[0154] In the motor control module 40 so constructed as mentioned above,the motor controller 45 in response to commands from an attentioncontrol module 64 (to be described later) controls driving of the motor41 via the PW M control circuit 43, the potentiometer 42 detects arotary position of the motor 41, the motor controller 45 via the ADconverter 44 extracts a robot direction 46, and the motor event former47 makes a motor event 49 made of motor directional data.

[0155] The viewer 48 is used to display a motor eventthree-dimensionally on a client screen. More specifically, as shown inFIG. 12(A) use is made of a three-dimensional viewer equipped, e.g., inOpen GL, to display in real time an orientation and a motion speed ofthe robot three-dimensionally in terms of a three-dimensionallyrepresented orientation of the robot 48 a, e.,g., the orientation of anarrow 48 b colored in red and a length of the arrow 48, respectively. Inthis manner, a three-dimensionally displayed motor event 49 of the robot10 can be instantly perceived by vision by seeing the display of theviewer 48. The visual point on the robot 10 can be varied as desired andcan also be zoomed both in and out.

[0156] As shown in FIG. 8, the interactive module 50 comprises a speaker51 and the microphone set 16 as a device layer, a speech (voice)synthesis circuit 52, a dialog control circuit 53 and an auto-voiceholdback circuit 54 as a process layer, and a speech (voice) recognitioncircuit 55. In the interactive module so constructed, the dialog controlcircuit 53 is controlled by the association module 60 to be describedlater, the speaker 51 is driven by the speech synthesis circuit 52 sothat a given speech may be emitted to the target or object speaker.After the auto-voice holdback circuit 54 eliminates the sound emittedfrom the speaker 51, from sound signals received from the microphone 16,the speech recognition circuit 55 recognizes a speech from the target orobject speaker. The interactive module 50 as shown dispenses with thehierarchical characterizing and event layers.

[0157] Here, the robot when serving as a receptionist for a party hasthe highest priority to continue its current attention, but when servingas a companion for the party is controlled to give its attention to astream most recently associated with.

[0158] As shown in FIG. 9, the association module 60 rankshierarchically higher than the audition module 20, the vision module 30,the motor control module 40 and the interactive module 50, andconstitutes a stream layer hierarchically higher than the event layersof those other modules 20, 30, 40, 50.

[0159] Specifically, the association module 60 comprises: a synchronizercircuit 62 for synchronizing asynchronous events 61 a from the auditionmodule 20, the vision module 30 and the motor control module 40, namelythe auditory event 28, the visual event 39 and the motor event 49 intosynchronous events 61 b; a stream former 63 for associating thesesynchronous events 61 b with each other to make an audition stream 65, avision stream 66 and an association stream 67; an attention controlmodule 64; and a viewer 68.

[0160] While the synchronizer circuit 62 synchronizes the auditory event28 from the audition module 20, the visual event 38 from the visionmodule 30 and the motor event 49 from the motor control module 40 toform an auditory, a visual and a motor event synchronized with eachother, the auditory event 28 and the visual event 38 have theircoordinate systems converted by the synchronized motor event into anabsolute coordinate system.

[0161] Here, the delay time it takes after each event is actuallyobserved until the event reaches the association module 60 via a network70 is, e.g., 40 milliseconds for the auditory event 28, 200 millisecondsfor the visual event 39, and 100 milliseconds for the motor event 49,apparently because of a delay of 10 to 200 milliseconds occurring in thenetwork and of their different arriving periods.

[0162] Thus, in order for them to be synchronized with each other, theauditory event 28, the visual event 39 and the motor event 49 from theaudition module 20, and vision module 30 and the motor control module 40are each provided with an time stamp information indicating the observedtime, which is once stored in a short term memory circuit (not shown)for a time period, e.g., 2 seconds.

[0163] The synchronizer circuit 62 is thus provided with a delay time of500 milliseconds set with the abovementioned delay times taken intoaccount and upon comparison with an actual time of observation, and bysynchronization processing takes out each event stored in the short termmemory circuit. The synchronization processing is designed to occur in aperiod of, e.g., 100 milliseconds. Since the events may arrive at theassociation module 60 asynchronously with each other, an event does notnecessarily exist that is concurrent with the time of observation setfor synchronization. Accordingly, the synchronization processing here isdesigned to make a linear interpolation for events taking place beforeand after the time of observation for synchronization.

[0164] The stream former 63 as shown in FIG. 14 reads out an auditoryevent S and a visual event V from the short term memory circuit M andmakes streams 65, 66 and 67 under the conditions stated below.

[0165] 1. An auditory event 28 as indicated by reference lettering Y1 isconnected to an auditory stream 65 which is the closest to it as havinga pitch that is equal to its frequency or is a harmonic of the frequencyand whose direction lies in a range within ±10°. The range within ±10°is selected by taking into account the accuracy of the auditory epipolargeometry.

[0166] 2. A visual event 39 as indicated by reference lettering Y2 isconnected to a visual stream 66 which is the closest to it as having acommon face ID 34 lying in a range within 40 cm. The range within 40 cmis selected under the assumption that a human object here will not moveat a speed of 4 meters per second

[0167] 3. If the search done for all the streams shows that there is anevent for which there exists no connectable stream 65 or 66, then theevent 28, 39 comes to constitute a new stream 65, 66 as indicated byreference lettering Y3.

[0168] 4. The already existing stream 65, 66 in the absence of an event28, 39 connectable thereto is allowed to continue to exist for a maximumtime period of 500 milliseconds as indicated by reference lettering Y4a, but if the state that the event cannot be connected continuesthereafter, is then allowed to cease existing as indicated by referencelettering Y4 b.

[0169] 5. If the state that an auditory stream 65 and a visual stream 66come close to each other within ±10° continues for a time period of 500millimeters in 1 second, the auditory stream 65 and the visual stream 66are regarded as being derived from an identical object speaker, and thenas indicated by reference lettering Y5 are associated with each other toform an association stream 67.

[0170] 6. The association stream 67 unless the auditory event 28 or thevisual event 39 continues for 3 seconds or more, has them dissociatedwhereby only either the existing auditory stream 65 or the existingvisual stream 66 continues to exist.

[0171] 7. The association stream 67 if the auditory stream 65 and thevisual stream 66 have their directional difference becoming ±30° or morehas them dissociated from each other, thereby returning to theirindividual auditory and visual streams 65 and 67.

[0172] Thus, making an auditory stream 65 and a visual stream 66 basedon an auditory and a visual event synchronized with each other by thesynchronizing circuit 62, the stream former 63 is designed to take atemporal link of such events into account to connect the synchronizedauditory and visual events to an auditory stream 65 and a visual stream66 of an identical speaker and then, if such streams 65 and 66 arestrongly linked with each other, to make an association stream 67 uponassociating them and, if the auditory and visual streams 65 and 66constituting the association stream 67 become weakly linked with eachother, to disassociate them.

[0173] Therefore, even if the object speaker is moving, it follows thatits movement if in the angular limits preset as mentioned above can bepredicted and tracked upon forming these streams 65, 66 and 67.

[0174] The attention control module 64 is designed to effect attentioncontrol required to make a plan for the drive motor control of the motorcontrol module 40. Reference is then made preferentially to theassociation stream 67, the auditory stream 65 and the visual stream 66in this order to effect the attention control.

[0175] The attention control module 64 makes a plan for the robot 10 toact according to how the auditory and visual streams 65 and 66 stand andwhether the association stream is present and, if the drive motor 41needs to be driven, transmits a motor event in the form of actioncommands via the network 70 to the motor control module 40.

[0176] Here, what the attention control by the attention control module64 gives is either to continue or to trigger; the former let the robotattend to tracking in an existing state and the latter let the robotattend to tracking an object that is then the most interesting.

[0177] Therefore, for the attention control:

[0178] 1. In the existence of an association stream, which shows that aperson speaking while standing opposite to the robot 10 either existscurrently or existed in the immediate past, the attention must bedirected at a high priority to such a person for its tracking.

[0179] 2. Since the microphone 16 is non-directional and thus is muchfree from such detectable limits as visual field angles of a camera,higher priority should be given to the auditory stream than to thevisual stream.

[0180] With these particulars taken into account, and also in accordancewith the following general rule, a stream to which attention must bedirected is selected and tracked.

[0181] 1. Top priority is given an association stream for tracking.

[0182] 2. In the absence of an association stream, priority should begiven an auditory stream for tracking.

[0183] 3. In the absence of both association and auditory streams,priority should be given the tracking of a visual stream.

[0184] 4. If a plurality of streams of the same type exists, then oldestone of them should be preferentially tracked.

[0185] The attention control module 64 by effecting the attentioncontrol in this manner, makes a plan for the drive control of the motor41 in the motor control module 40, forms motor commands 66 in accordancewith the plan so made and transmits them to the motor control module 40via the network 70. Then, in the motor control module 40, the motorcontroller 45 responding to the motor control commands 66 performs PWMcontrol to rotationally drive the motor 41 so as to direct the robot 10towards a given appropriate direction.

[0186] The viewer 68 is designed to display thus formed streams on ascreen of the server. More specifically, as shown in FIG. 12(B) they aredisplayed in a radar chart 68 a and a stream chart 68 b. The radar chart68 a displays an instantaneous state of the association stream in termsof the visual field angles 68 a 1 of the camera shown by, e.g., a firstfan-shape zone that is broad and bright (colored in pink as shown) andthe direction of the stream 68 a 2 shown by, e.g., a second fan-shapedzone that is narrow and dark. The stream direction 68 a 2 here isdisplayed e.g., colored in red in the presence of both an auditory and avisual stream, colored in blue in the presence of only an auditorystream and colored in green in the presence of only a visual stream.

[0187] The stream chart 68 b displays an association stream 68 b 1indicated by a thick solid line and an auditory or visual stream 68 b 2indicated by a thin solid line. The association stream 68 b 1 isdisplayed, e.g., colored in red in the presence of both an auditory anda visual stream, colored in blue in the presence of only an auditorystream and colored in green in the presence of only a visual stream. Thestream 68 b 2 indicated by the thin solid line is displayed, e.g.,colored in blue if it is an auditory stream and colored in green if itis an visual stream.

[0188] Thus, seeing the display of the viewer 68 allows theinstantaneous auditory and visual stream to be perceived at once byvision from the radar chart 68 a and the temporal flows of the visualand auditory streams to be instantly perceived by vision from the streamchart 68 b. Also, the visual perception of a color of display thatappears allows seizing readily what stream is being tracked in theattention control.

[0189] When the humanoid robot 10 in the illustrated form ofimplementation of the present invention so constructed and configured asdescribed above serves as a receptionist robot for a party, it operatesas described below with reference to FIG. 10.

[0190] First, the robot 10 as shown in FIG. 10(A) is disposed at theentrance of a party hall.

[0191]FIG. 10(B) shows that while a participant P is approaching therobot 10, the robot has not recognized the participant P. Here, if theparticipant P by saying, e.g., “Hello!” speaks to the robot 10, then therobot 10 has the microphone set 16 picking up the voice of theparticipant P, and the audition module 20 forming an auditory event 28that identifies the direction of the voice (sound) source andtransmitting it via the network 70 to the association module 60. Then,the association module 60 forms an auditory stream 29 in response to theauditory event 28. At this point of time, the vision module 30 forms novisual event since the participant P has not enter the visual field ofthe camera 15. As a consequence, the association module 60 forms onlythe auditory stream 29 in response to the auditory event 28 and triggersit through the attention control module 64 to effect an attentioncontrol such as to bring the robot 10's attention to the guest P.

[0192] In this way, the so-called tracking by voice is effected wherebythe robot 10 is directed to turn its head towards the participant P asshown in FIG. 10(C). Next, the vision module 30 captures an image of theparticipant P's face by the camera 15 to form a visual event 39,searches the participant P's face from the face database 38 for itsidentification, and transmits the ID 24 of the identified face and itsimage to the association module 60 via the network 70. Then, if theparticipant P's face has not been registered in the face database 38,the vision module 30 communicates this to the association module via thenetwork 70.

[0193] At this point of time, the robot 10 has an association stream 65formed from the auditory event 28 and the visual event 39. Then, sincethe attention control module 64 in response to this association does notalter the attention control, the robot 10 keeps looking towards theparticipant P. Consequently, even if the participant P moves, the robot10 with the motor control module 40 controlled by the association stream65 tracks the participant P whereby the camera 15 in the vision module30 is allowed to continue imaging the participant P. On the other hand,the association module 60 gives an input to the speech recognitioncircuit 55, and the latter in turn gives a result of the speechrecognition to the dialog control circuit 53. The dialog control circuit53 synthesizes a speech to cause it to be emitted through the speaker51. Then, the speech recognition circuit 55 makes the robot 10 capableof recognizing a speech of the participant with an increased precisionby disregarding its own utterance, namely by reducing through theauto-voice holdback circuit 54 its own voice to be emitted from thespeaker 51, from the sound signal from the microphone 16.

[0194] Here, speaking by speech synthesis differs dependent on whetheror not the participant P's face has been registered in the face database38. If the participant P's face has been registered in the face database38, then the association module 60 on the basis of the face ID 24 fromthe vision module 30 may control the dialog module 50 to cause thelatter to synthesize a speech such that the robot 10 may ask theparticipant P a question: “Good afternoon, Mr. XXX?”

[0195] Then, if the participant P answers “Yes”, the dialog module 50 inresponse to the sound signal from the microphone 16 recognizes “Yes” bythe speech recognition circuit 55 and causes the dialog control circuit53 to synthesize a speech such that the robot may utter through thespeaker 51: “Welcome to the party, Mr. (or Ms.) XXX. Come in to thehall, please.”

[0196] If the participant P's face has not been registered in thedatabase 38, the association module 60 acts on the dialog module 50 tocause it to synthesize a speech such that the robot may ask theparticipant P a question: “Good afternoon, may I have your name?”

[0197] Then, if the participant P answers “My name is XXX”, the dialogmodule 50 in response to the sound signal from the microphone 16recognizes “XXX” through the speech recognition circuit 55 and causesthe dialog control circuit 53 to synthesize a speech such that the robotmay utter through the speaker 51: “Welcome to the party, Mr. (or Ms.)XXX. Come in to the hall, please.”

[0198] While the robot 10 in this manner makes recognition of theparticipant P and guides its entrance to the party hall, in the visionmodule 30 it has the name and the face image of the participant Pregistered in the face database 38.

[0199] Should the humanoid robot 10 serve as a companion robot, it mayoperate as follows: Operable in response to an auditory event 28 fromthe audition module 20 and a visual event 39 from the vision module 30and to an association stream 65 from the association module 60, thehumanoid robot 10 is recognizing a plurality of object speakers by bothaudition and vision and is made also capable of tracking one of themselectively and switching from one of them to another midway oftracking.

[0200] The robot 10 when serving as a companion robot plays a passiverole; it only “listens and looks” to the participants to the party, anddoes not speak to them by the dialog module 50.

[0201] The humanoid robot 10 as a companion robot may operate asdescribed below with reference to FIG. 15.

[0202] First, the humanoid robot 10 having no particular well-definedscenario acts, e.g., to track one of four speakers who appear on stagein FIG. 15 and to switch its attention to another of them midway. And,its action can readily be seized and assessed by visually recognizingthe displays on the viewers 27, 37, 48 and 68.

[0203] Here, what each of FIGS. 15(A) to 15(H) shows comprises asnapshot that appears in its upper left area, and displays of theviewers 68, 27 and 28 which appear in its upper right, lower left andlower right areas, respectively.

[0204] First, as shown in FIG. 15(A) the humanoid robot 10 detectsthrough its vision module 30 the face of the leftmost speaker as theviewer 28 in the lower right hand side displays within its rectangularframe. Then, a visual event for that speaker (the first speaker) in thestream chart 68 b in the upper right hand side constitutes a visualstream indicated by the thin solid line 68 b 1.

[0205] Next, when the first speaker commences speaking, the auditionmodule 20 detects an auditory event 28, which as shown in FIG. 15(13) isdisplayed as a small circle in the right hand side window 27 b on theviewer 27 in the lower left hand side of the Figure, and also from whicha set of peaks of harmonics in its power spectrum is extracted in theleft hand side window 27 a of the viewer. And, this auditory event 28 isconverted in the association module 60 into an auditory stream, which isdisplayed as a narrow and dark fan 68 a 2 in the radar chart 68 a on theviewer 68 in the upper right hand side.

[0206] Then, with the auditory stream 65 and the visual stream 66 formedby the association module 60 from the auditory event 28 and the visualevent 38 for the first speaker and having a common direction over afixed time period, an association stream 67 is formed at the streamformer 63 in the association module 60, which as shown in FIG. 15(C) isdisplayed in the form of a thick solid line in the stream chart 68 b onthe upper right hand side viewer 68. Attention control for the firstspeaker is now activated.

[0207] Thereafter, when the first speaker discontinues speaking as shownin FIG. 15(D), the abovementioned association stream is disassociated,the attention to the first speaker is defocused, and the attentioncontrol is interrupted.

[0208] Now, as the rightmost person begins to talk, attention isdirected to this speaker. But, when the humanoid robot 10 is thusturning to face the speaker, this person ceases talking. Then, thevision module 30 happening to detect the face of a second from leftperson, the humanoid robot stops turning horizontally.

[0209] Then, as shown in FIG. 15(E) the second from left person beginsto talk, an association stream for this speaker is formed and attentionis directed to him.

[0210] Thereafter, as shown FIG. 15(F), that speaker stops talking, butits visual event continually detected allows the association stream 67to remain to exist for several seconds.

[0211] And then, as shown in FIG. 15(G), the association stream 67vanishes and thereafter a third from left person commences talking. Thehumanoid robot 10 is turning to face this person. With the vision module30 unable to detect this speaker, however, neither an visual event 38nor an visual stream 66 therefore is formed; hence an association streamis not formed therefor.

[0212] Thereafter, as shown in FIG. 15(H) the leftmost (first) personbegins talking again. Then, since the speaker has not been detected bythe vision module 30 and thus neither a visual event 38 nor a visualstream 66 therefore is formed and further no association stream thereforis formed, the humanoid robot 10 triggered by the auditory event 28 isturning to face the first person.

[0213] In this way, the humanoid robot 10 on the basis of an auditoryand a visual event 28 and 39 formed by the audition and vision modules20 and 30, respectively, and an association stream 65 formed by theassociation module 60 is capable of auditorily and visually recognizinga plurality of speakers, selectively tracking one of them, and switchingmidway one from one speaker to another for tracking.

[0214] The humanoid robot 10 when serving as a companion robot mayoperate as described below with reference to FIGS. 16 and 17.

[0215] For the humanoid robot 10 having no particular well-definedscenario, use may be made of a scenario, e.g., as shown in FIG. 16, as abench mark to assess the present system. FIGS. 17(A), 17(B) and 17(C)show the facing direction of the robot, tracking by a visual stream andtracking by an auditory stream, respectively.

[0216] In this scenario, two speakers A and B take a variety of actionsfor a time period of about 40 seconds. The scenario is specifically asflows:

[0217] At a time instant t1, Mr. A enters the visual field of the robot10. The vision module 30 detects the face of Mr. A to form a visualevent for him, and to cause a visual stream for him to be formed by theassociation module 60.

[0218] At a time instant t2, Mr. A begins to talk to the robot 10. Theaudition module 20 detects the voice of Mr. A to form an auditory eventfor him and to cause the association module 60 to form an auditorystream 65 and an association stream 67 for him. The visuoauditory servoof the robot is thereby actuated.

[0219] At a time instant t3, Mr. B begins to talk in the outside of thevisual field of the robot 10. The audition module 20 detects the voiceof (invisible) Mr. B to form an auditory event for him and to cause anauditory stream for him to be formed by the association module 60.

[0220] At a time instant t4, Mr. A upon moving hides himself into theshade. This causes the vision module 30 to cease forming the visualevent for Mr. A and causes the visual stream for him to break off. Theassociation stream remains to exist for a given time period.

[0221] At a time instant t5, Mr. A coming out of the shade again appearsThen, the vision module 30 forms a visual event for Mr. A again andcauses an association stream 67 for him to be again formed by theassociation module 60.

[0222] Thereafter, Mr. A stops talking, and again hides himself into theshade. Then, since the vision module 30 ceases forming the visual eventfor Mr. A and causes the visual stream for him to break off, theassociation stream 67 for him is disassociated and vanishes after agiven time period.

[0223] At a time instant t7, the auditory stream is triggered to causethe robot 10 to turn towards Mr. B talking.

[0224] At a time instant t8, the robot 10 captures Mr. B in its visualfield. The vision module 30 forms a visual event for Mr. B and causes avisual stream and an association stream 67 to be formed by theassociation module 60.

[0225] At a time instant t9, Mr. A while talking enters the visual fieldof the robot 10. The audition and vision modules 20 and 30 form anauditory and a visual event for Mr. A, and causes an auditory and avisual stream to be formed by the association module 60.

[0226] At a time instant t10, Mr. B stops talking. The audition module20 ceases forming the auditory event for Mr. B, and the associationmodule 60 disassociate the association stream for Mr. B to extinguishthe auditory stream for him while leaving only the visual stream forhim. And then, the attention of the robot 10 is directed to Mr. A andlikewise an association stream for him is formed.

[0227] In the abovementioned scenario followed, it can be seen that therobot visuoauditory system according to the present invention hasfeatures as follows:

[0228] 1. When the new association stream is formed at the time instantt1, t6, the attention in the attention control module 64 is directed tothe new association.

[0229] 2. Even at a time instant such as t4 or t5 when the visual streamis broken off in the association stream by Mr. A disappearing, thecontinuation of the association stream allows the tracking of Mr. A bythe auditory stream to be continued.

[0230] 3. When the association stream is vanished at the time instantt6, t11, the attention control is performed by the auditory stream thathas high priority next to the association stream to continue thetracking of the speaker so that the robot 10 is made facing opposite tothe speaker as an object to be tracked to detect without fail the voiceof the speaker arriving from the front as shown in FIG. 16.

[0231] 4. At the time instant t9 and thereafter, the two speakers,Messrs. A and B, lie close to each other so that they fall within thevisual field of the camera 15 (with a difference in direction of about20 degrees). Even then, the auditory, visual and association streams forone of them are formed separately from those for the other to ensurethat the individual speakers are tracked accurately and without fail.

[0232] Thus, the humanoid robot 10 is provided with its visuoauditoryservo wherein if an auditory and a visual event are formed, they areassociated with each other to form an association stream on the basis ofwhich attention control is effected. Compared with the conventionalauditory or visual servo, the visuoauditory servo of the invention usingboth audition and vision makes it possible to track a speaker much moreaccurately.

[0233] Also, if the tracking by the visual stream is broken off as shownin FIG. 17(B) (showing a fist candidate only of the visual event) as theresult of the speaker hiding itself into the shade or moving into theoutside of the visual field and thus disappearing, as shown in FIG.17(C) the association stream 67 by the auditory stream allows thespeaker to be tracked without fail, thereby permitting a plurality ofspeakers to be recognized by vision and audition and also one of them tobe tracked or allows switching midway from one of them to another fortracking.

[0234] From FIG. 17(B) it is seen that in the time period t4-t5 in whichthe visual stream is broken off and in the time period t6-t7 in whichthe speaker lies in the outside of the robot's visual field the speakercannot be tracked by the visual stream. In these time periods, however,the speaker could be tracked without fail by reference to the auditorystream as shown in FIG. 17(C).

[0235] In FIG. 17(C) it is also seen that in the time period of t3 toaround 23 seconds and in the time period of abound 34 seconds to t10 theauditory streams of Messrs. A and B are separated from each other, butin time periods around t8 and around t6 an erroneous auditory stream isformed. Also, in a time period of t5 to around 17 seconds the concurrentoccurrence of Mr. A's movement and the horizontal rotation of the robot11 causes the quality of a sound signal from a sound source to bereduced owing to the speaker's movement and motor noises and theirechoes so that the two speakers cannot be located much accurately. Evenin such a case, referring to the visual stream as shown in FIG. 17(B)allows a speaker to be tracked without fail.

[0236] Thus, performing the visuoauditory servo of a robot using bothaudition and vision whereby an auditory and a visual stream areassociated with each other to form an association stream allows theambiguities which the auditory and visual streams individually possessto complement each other, thereby rising the so-called robustness of thesystem. Moreover, more than one speakers can be thereby individuallyperceived and tracked without fail.

[0237] Further, the humanoid robot 10 when serving as a party companionrobot may own the face database 38 in common with a party receptionistrobot, or may have the face database 38 for the party receptionist robottransmitted or copied thereto. This allows the humanoid robot 10 as aparty companion robot to always recognize all the party participants byidentifying their faces.

[0238] Thus, the humanoid robot 10 according to the present inventionimplemented as illustrated wherein on the basis of an auditory and avisual event from the audition and vision modules 20 and 30 theassociation module 60 forms an auditory and a visual stream and anassociation stream for each of a plurality of object speakers torecognize them is capable of auditorily and/or visually tracking thesespeakers in real time, even if any of the events is missing or itbecomes impossible to recognize it distinctly, e.g., if a speaker movesand becomes “invisible”, then by audition, and if a speaker ceasestalking and becomes “inaudible”, then by vision.

[0239] Also, the humanoid robot 10 according to the present inventionimplemented as illustrated wherein on the basis of an auditory and avisual event from the audition and vision modules 20 and 30 theassociation module 60 forms an auditory and a visual stream and anassociation stream for each of a plurality of object speakers and formsrespective auditory and visual streams for the speakers by taking intoaccount their temporal flows from their respective directional data andindividual speaker identifications to recognize these individualspeakers, respectively, is capable of auditorily and/or visuallytracking these speakers in real time, even if any of the events ismissing or it becomes impossible to recognize it distinctly, e.g., if aspeaker moves and becomes “invisible”, then by audition, and if aspeaker ceases talking and becomes “inaudible”, then by vision.

[0240] Also, the humanoid robot 10 according to the present inventionimplemented as illustrated wherein on the basis of an auditory and avisual event from the audition and vision modules 20 and 30 theassociation module 60 forms an auditory and a visual stream and anassociation stream for each of a plurality of object speakers tovisuoauditorily recognize them is capable of tracking these speakersmore accurately than with the servo by either audition or vision aloneand is capable of auditorily and/or visually tracking these speakers inreal time, even if any of the events is missing or it becomes impossibleto recognize it distinctly, e.g., if a speaker moves and becomes“invisible”, then by audition, and if a speaker ceases talking andbecomes “inaudible”, then by vision.

[0241] The modules having auditory information, visual information,motor status information and stream information displayed in theirrespective viewers allow all the information to be perceived outright byvision and the status of operation of the humanoid robot 10 to bereadily assessed. In this case, the colors used for display in theviewers are made different for different statuses of an associationstream but unified between different viewers, thus permitting eachstatus of the association stream to be readily seized.

[0242] While in the form of implementation described above the humanoidrobot 10 is shown constructed as having 4 (four) DOFs (degrees offreedom) in its action, this is not intended to be a limitation but therobot auditory system according to the present invention may beincorporated into a robot constructed to operate in any way as desired.

[0243] Also, while the form of implementation of the invention in whichthe robot visuoauditory system is incorporated into the humanoid robotis described above, this is not intended to be a limitation but thesystem may be incorporated in any form of robot such as an animal typerobot, e.g., made to resemble a dog, or a plant type robot, e.g., madeto resemble a flower or any other type of robot.

[0244] Further, while the form of implementation of the presentinvention is described above wherein if a plurality of streams of thesame type exist, the oldest stream is preferentially tracked, this isnot intended to be a limitation but another stream, e.g., the neweststream may preferentially be tracked.

[0245] According to the present invention described hereinbefore, theaudition module receives sounds collected by microphones from externalobjects as sound sources and extracts pitches from the collected soundsutilizing their harmonic structures to find the directions in which thesound sources exist, respectively, and to identify the individualspeakers as the respective object sound sources and then extracts theirown auditory events. Also, the vision module receives images of thespeakers taken by a camera and extracts from the images the respectivevisual events of the individual speakers upon identifying the faces ofthe speakers by pattern recognition and locating the speakers. Further,the motor control module extracts a motor event upon detecting adirection in which the robot faces on the basis of a rotary position ofthe drive motor turning the robot horizontally.

[0246] Here, the association module forms from the auditory events, thevisual events and the motor event thus extracted, respectively, anauditory stream and a visual stream for each of the speakers andassociates these streams with each other to form an association streamfor each of the speakers and then perform attention control on the basisof these streams to make a plan of the course in which the drive motoris controlled in the motor control module.

[0247] And, the attention control module directs the robot to facetowards an object speaker by controlling the drive motor in the motorcontrol module on the basis of the plan made. Thus, directing the robotto face opposite to the object speaker allows the audition module tocollect through the microphones a voice of the object speaker in thefront direction in which the voice can be collected at the highestsensitivity and thereby to locate the object speaker accurately, whilepermitting at the same time the vision module to take through the cameraa clear-cut image of that object speaker.

[0248] Accordingly, the audition module, the vision module and the motorcontrol module made to cooperate with the association module and theattention control module allow the ambiguities which the audition andvision of the robot individually possess to complement each other,thereby rising the so-called robustness of the system. Moreover, morethan one speakers can be thereby individually perceived.

[0249] Also, even when either its auditory or visual event is missing orbroken off, an object speaker can be tracked through attention controlon the basis of its visual or auditory event that remains extant,thereby permitting the motor control module to be controlled in realtime.

[0250] Further, dealing with events and streams which are each asymbolic set of information enables data to be processed in higher leveland serves to repress amounts of data to be transmitted via a network.Hence, the traffic entailed in the data transmission is lightened, whichin turn contributes to the real-time processing in the entirevisuoauditory system and also in the viewers.

[0251] Thus, there is provided in accordance with the present inventiona highly eminent robot visuoauditory system that enables an object orobjects to be tracked by both vision and audition and processed in realtime.

[0252] Here, an association module is provided which on the basis of anauditory event, a visual event and a motor event thus far extracted,respectively, forms an auditory stream and a visual stream for each ofspeakers with reference to its directional information and uponidentifying the speaker and associates these streams with each other toform an association stream. And, an attention control module is providedwhich on the basis of these streams effects attention control, therebymaking a plan of the course in which to control the drive motor in themotor control module. Then, the association module upon locating thesound source of the auditory event and locating the face of the visualevent, that is to say on the basis of directional information by each ofaudition and vision determines the direction in which each speaker ispresent and forms an auditory and a visual stream and an associationstream therefor.

[0253] Also, the attention control module on the basis of the plan madeacts to control the drive motor in the motor control module and todirect the robot to face towards an object speaker. Thus, directing therobot to face opposite to the object speaker allows the audition moduleto collect through the microphones a voice of the object speaker in thefront direction in which the voice can be collected at the highestsensitivity and thereby to locate the object speaker accurately, whilepermitting at the same time the vision module to take through the cameraa clear-cut image of that object speaker.

[0254] Accordingly, the audition module, the vision module and the motorcontrol module are made to cooperate with the association module and theattention control module to allow tracking the speaker with reference tothe directional information for each of the auditory and visual eventsof the speaker identified as well as their temporal flows and thus toallow the ambiguities which the audition and vision of the robotindividually possess to complement each other, thereby rising theso-called robustness of the system. Moreover, more than one speakers canbe thereby individually perceived without fail.

[0255] Also, even when either its auditory or visual event is missing orbroken off, an object speaker can be perceived through attention controlon the basis of its visual or auditory event that remains extant, andthereby to permit the motor control module to be controlled upon seizingthe direction of the object speaker accurately.

[0256] Also, according to the present invention, making the auditionmodule, the vision module and the motor control module to cooperate withthe association module and the attention control module allows theambiguities which the audition and vision of the robot individuallypossess to complement each other, thereby rising the so-calledrobustness of the system. Moreover, more than one speakers can bethereby individually perceived.

[0257] Also, even when either its auditory or visual event is missing orbroken off, an object speaker can be perceived by the attention controlmodule on the basis of only its visual or auditory event that remainsextant, thereby permitting the motor control module to be controlled inreal time.

[0258] Further, displays are included to display at least a portion ofthe auditory information by the audition module, the visual informationby the vision module and the motor information by the motor controlmodule and the stream information by the association module and tovisualize the real-time processing by the association module, therebypermitting the status of the real-time processing to be instantlyperceived by vision.

[0259] Also, according to the present invention, the audition module,the vision module and the motor control module are made to cooperatewith the association module and the attention control module to allowboth audition and vision to be used while permitting the ambiguitieswhich the audition and vision of the robot individually possess tocomplement each other, thereby rising the so-called robustness of thesystem. Moreover, more than one speakers can be thereby individuallyperceived without fail.

[0260] Also, even when either its auditory or visual event is missing orbroken off, an object speaker can be perceived through attention controlon the basis of only its visual or auditory event that remains extant,thereby permitting the motor control module to be controlled in realtime.

[0261] Further, by tracking a speaker by using either or both of theauditory and visual streams depending on the particular circumstancesencountered, it is possible to track the speaker all the times withgreater accuracy. Also, if more than one auditory streams and more thanone visual streams are present, suitably using one or two of themdepending on the particular circumstances encountered allows theindividual speakers to be tracked with greater accuracy.

[0262] There is thus provided in accordance with the present invention ahighly eminent robot visuoauditory system whereby the visual andauditory information for each of objects are unified to track theobjects without fail.

INDUSTRIAL APPLICABILITY

[0263] A robot visuoauditory system according to the present inventionis applicable not only to a party receptionist and/or companion robotbut to almost the entire field of industry. For example, it is thoughtto be applicable to various receptionist robots and monitoring robots.

1. What is claimed is:
 1. A visuoauditory system for a robot,characterized in that it comprises: an audition module including atleast a pair of microphones for collecting external sounds; a visionmodule including a camera for taking an image in front of the robot; amotor control module including a drive motor for turning the robothorizontally, an association module for combining events from saidaudition module, said visual module and said motor control module toproduce streams; and an attention control module for effecting attentioncontrol on the basis of the streams produced by said association module,whereby: said audition module in response to sound signals from saidmicrophones is adapted to extract pitches therefrom, separate theirsound sources from each other and locate sound sources such as toidentify a sound source as at least one speaker, thereby extracting anauditory event; said vision module on the basis of an image taken by thecamera is adapted to identify by face, and locate, each such speaker,thereby extracting a visual event therefor; said motor control module inresponse to a rotary position of said drive motor is adapted to extracta motor event therefrom; said association module on the basis of theauditory event from said audition module, the visual event from saidvision module and the motor event from said motor control module isadapted to form an auditory stream and a visual stream and thenassociate these streams with each other to form an association stream;and said attention control module in effecting the attention control isadapted to make a plan for a course in which to control said drive motorin said motor control module in accordance with said streams.
 2. A robotvisuoauditory system as set forth in claim 1, characterized in that saidassociation module when forming said auditory and visual streams isadapted to synchronize said auditory, visual and motor events which areformed asynchronously with each other.
 3. A robot visuoauditory systemas set forth in claim 1 or claim 2, characterized in that saidassociation module is made a server and each of said audition, visionand motor control modules are made a client connected to said server. 4.A robot visuoauditory system as set forth in claim 3, characterized inthat said server and each said client are interconnected via LAN.
 5. Arobot visuoauditory system as set forth in any one of claims 1 to 4,characterized in that when said attention control module effects theattention control, higher priority is given in the order of theassociation stream, the auditory stream and the visual stream.
 6. Arobot visuoauditory system as set forth in any one of claims 1 to 4,characterized in that said association module when forming said auditoryand visual streams is adapted to synchronize said auditory, visual andmotor events with each other.
 7. A visuoauditory system for a robot,characterized in that it comprises: an audition module including atleast a pair of microphones for collecting external sounds; a visionmodule including a camera for taking an image in front of the robot; amotor control module including a drive motor for turning the robothorizontally, an association module for combining events from saidaudition module, said visual module and said motor control module toproduce streams; and an attention control module for effecting attentioncontrol on the basis of the streams produced by said association module,whereby: said audition module in response to sound signals from saidmicrophones is adapted to extract pitches therefrom, separate theirsound sources from each other and locate sound sources such as toidentify a sound source as at least one speaker, thereby extracting anauditory event; said vision module on the basis of an image taken by thecamera is adapted to identify by face, and locate, each such speaker,and then to identify it as the sound source, thereby extracting a visualevent therefor; said motor control module in response to a rotaryposition of said drive motor is adapted to extract a motor eventtherefrom; said association module on the basis of said auditory event,said visual event and said motor event is adapted to form an auditorystream and a visual stream upon determining a direction in which eachsuch speaker lies on the basis of directional information derived fromlocating the sound source of said auditory event and that from locatingthe face of said visual event, and then to associate these streams witheach other to form an association stream; and said attention controlmodule in effecting the attention control is adapted to make a plan fora course in which to control said drive motor in said motor controlmodule in accordance with said streams.
 8. A robot visuoauditory systemas set forth in claim 7, characterized in that said association modulewhen forming said auditory and visual streams is adapted to synchronizesaid auditory, visual and motor events which are formed asynchronouslywith each other.
 9. A robot visuoauditory system as set forth in claim 7or 8, characterized in that: said audition module is adapted to identifyeach such speaker upon detecting a voice MFCC from the sound signals,and said association module is adapted to select said auditory streamand said visual stream to which said auditory event and said visualevent are to be connected, by specifying each such speaker on the basisof speaker identification for said auditory event and speakeridentification for said visual event.
 10. A robot visuoauditory systemas set forth in any one of claims 7 to 9, characterized in that with aplurality of streams coming close to one another, said associationmodule is adapted to select said auditory stream and said visual streamto which said auditory event and said visual event are to be connected,by referring to temporal flows of said auditory and visual events.
 11. Arobot visuoauditory system as set forth in any one of claims 7 to 10,characterized in that said association module is adapted to associatesaid auditory stream and said visual stream with each other to form saidassociation stream if they are strongly associated with each other andto disassociate and extinguish said association stream when saidauditory and visual streams which make up the same become weaklyassociated with each other.
 12. A visuoauditory system for a robot,characterized in that it comprises: an audition module including atleast a pair of microphones for collecting external sounds; a visionmodule including a camera for taking an image in front of the robot; amotor control module including a drive motor for turning the robothorizontally, an association module for combining events from saidaudition module, said visual module and said motor control module toproduce streams; and an attention control module for effecting attentioncontrol on the basis of the streams produced by said association module,whereby: said audition module in response to sound signals from saidmicrophones is adapted to extract pitches therefrom, separate theirsound sources from each other and locate sound sources such as toidentify a sound source as at least one speaker, thereby extracting anauditory event; said vision module on the basis of an image taken by thecamera is adapted to identify by face, and locate, each such speaker,thereby extracting a visual event therefor; said motor control module inresponse to a rotary position of said drive motor is adapted to extracta motor event therefrom; said association module on the basis of theauditory event, the visual event and the motor event is adapted to forman auditory stream and a visual stream and then associate these streamswith each other to form an association stream; and said attentioncontrol module in effecting the attention control is adapted to make aplan for a course in which to control said drive motor in said motorcontrol module in accordance with said streams, wherein said systemfurther comprises a display for displaying at least a portion ofauditory information for said audition module, visual information forsaid vision module, motor information for said motor control module andstream information for said association module.
 13. A robotvisuoauditory system as set forth in claim 12, characterized in thatsaid display includes an auditory display for displaying as saidauditory information, a spectrum of sound signals from sound sources andpeaks extracted therefrom and said auditory event.
 14. A robotvisuoauditory system as set forth in claim 13, characterized in thatsaid auditory display is adapted to display said auditory event in theform of a circle with its center representing the robot, its axis ofordinates representing relative angles of the direction that the robotmay face, its axis of abscissas representing pitches and its diameterrepresenting a belief factor.
 15. A robot visuoauditory system as setforth in any one of claims 12 to 14, characterized in that said displayincludes a visual display for displaying as said visual information, animage of an extracted face taken by the camera and shown within a frame,and said visual event.
 16. A robot visuoauditory system as set forth inclaim 15, characterized in that said visual display is adapted todisplay said visual event in the form of list of face identificationsand face locations extracted with degrees of firmness.
 17. A robotvisuoauditory system as set forth in any one of claims 12 to 16,characterized in that said display includes a motor display fordisplaying as said motor information, the direction in which and thespeed at which the robot moves, three-dimensionally in real time.
 18. Arobot visuoauditory system as set forth in claim 17, characterized inthat said display includes a stream display for displaying as saidstream information, a stream chart and a radar chart.
 19. A robotvisuoauditory system as set forth in claim 18, characterized in thatsaid stream display is adapted to display in said stream chart, saidauditory, visual and association streams constituting said streaminformation.
 20. A robot visuoauditory system as set forth in claim 18,characterized in that said stream display is adapted to display in saidradar chart, an instantaneous status of said streams with a visual fieldof the camera and upon sound source location.
 21. A visuoauditory systemfor a robot, characterized in that it comprises: an audition moduleincluding at least a pair of microphones for collecting external sounds;a vision module including a camera for taking an image in front of therobot; a motor control module including a drive motor for turning therobot horizontally, an association module for combining events from saidaudition module, said visual module and said motor control module toproduce streams; and an attention control module for effecting attentioncontrol on the basis of the streams produced by said association module,whereby: said audition module in response to sound signals from saidmicrophones is adapted to extract pitches therefrom, separate theirsound sources from each other and locate sound sources such as toidentify a sound source as at least one speaker, thereby extracting anauditory event; said vision module on the basis of an image taken by thecamera is adapted to identify by face, and locate, each such speaker,thereby extracting a visual event therefor; said motor control module inresponse to a rotary position of said drive motor is adapted to extracta motor event therefrom; said association module on the basis of theauditory event, the visual event and the motor event is adapted to forman auditory stream and a visual stream and then associate these streamswith each other to form an association stream; and said attentioncontrol module in effecting the attention control is adapted to make aplan for a course in which to control said drive motor in said motorcontrol module in accordance with said streams, thereby effecting avisuoauditory servo for the robot.
 22. A robot visuoauditory system asset forth in claim 21, characterized in that when said attention controlmodule effects the attention control, higher priority is given in theorder of the association stream, the auditory stream and the visualstream.
 23. A robot visuoauditory system as set forth in claim 21 orclaim 22, characterized in that said attention control module in theabsence of more than one auditory or visual streams is adapted to selectfrom them one auditory or visual stream according to circumstances, andto form an association stream, thereby effecting the attention controlon the basis of such an auditory, visual or association stream.
 24. Arobot visuoauditory system as set forth in any one of claims 1 to 23,characterized in that said robot is a humanoid robot.
 25. A robotvisuoauditory system as set forth in any one of claims 1 to 23,characterized in that said robot is an animal type robot.
 26. A robotvisuoauditory system as set forth in any one of claims 1 to 23,characterized in that said robot is a plant type robot.